#this generative ai dataset nonsense is the same thing but instead of wages it's...royalties. i suppose. residuals. | Explore Tumblr posts and blogs

nientedal · 10 months ago

Note

I appreciate the input, and I understand where you're coming from and already agree with some of what you're saying! CommonCrawl's sets exist for public use for exactly the kind of analysis you describe, this is a good thing, yes. Fully agree with you there and have never disagreed with that. The part where I get lost is...correct me if I'm wrong, but to your way of thinking, the datasets are vast enough to dilute any chance of regurgitated phrases or direct plagiarism, so no harm, no foul? Nobody's livelihood is threatened, therefore all is well?

Assuming that is more or less where you're coming from...again, I hear you, but I have trouble engaging with that stance because it doesn't actually address my problem. I'm not worried about regurgitation. My problem is not with the output at all.

My problem is with scraped data being used for massive profit on a massive scale with no permission or compensation. Full stop, that is where "is this ethical" begins and ends for me. Is it ethical to use someone else's work to generate billions-- not an exaggeration-- of dollars of profit? No. What if it's ten million "someone else"s, is it ethical then? No. What if it's JSTOR developing their own tool, relying on the tool that uses the unpaid work of ten million someone elses-- is that ethical? No. Can that be ethical? No. It can be used to do good things, but it cannot be used ethically; these statements are not mutually exclusive.

From where I'm sitting, the size of the dataset or dilution of any one piece in relation to the whole is not relevant except to indicate how many people have been exploited to develop these tools. I used sand as my analogy for a reason-- it's easy to look at the sandbox of generative AI and say, "no single grain meaningfully influenced the building of that castle. The amount any particular grain contributed to the whole is minimal." No one is hurt when the robot builds a sand castle, so who cares about the individual grains?

Me. I do. The castle could not exist without those individual grains, every single one of which took a human person some amount of time to make (time, and education, and practice, and labor, and thought, and energy; we're talking hours and days and years of work) and every single one of which is being used to generate enormous profit without permission or compensation.

That's my problem. You may not agree that this is a reasonable concern, and that's okay! We'll agree to disagree.

I'll address fair use under the cut, because I think I may not have been super clear on what I meant about that, and trying to explain it got a little long. It doesn't change anything up here, though, so if you wanna skip it that's totally cool. (And yes, let's assume we're talking exclusively about text-based stuff lol, image stuff is a topic for another post. My stance is the same, though.) Anyway, "fair use" in this context refers to a legal doctrine, not a moral judgment.

When I say there are fair use problems with generative AI, I mean that from a legal perspective. You may already have known that, I don't know-- you disagree that there are problems under fair use, but...your post doesn't really discuss fair use at all? Legally? You do sort of touch on one of the factors, the fourth one, and to be clear, it's a solid argument. Another argument would be that use of copyrighted materials in developing and training generative AI is transformative. That's up for debate, but it is an argument I've seen and I understand the reasoning behind it. I also understand why we wouldn't want it to fall under scrutiny.

But there are also arguments against fair use here, enough that several copyright lawsuits to that effect have already been brought against Microsoft and OpenAI and I think a couple of other corporations. (Disclaimer-- I'm an accountant, not a lawyer. What I'm saying is effectively recapping what I've read previously from actual lawyers, and I'm googling as I go to make sure I am not flat-out wrong on the face of this, lol.)

In evaluating a claim under the fair use doctrine, courts typically look at four factors:

Purpose and character of the use, including whether the use is for profit,

Nature of the copyrighted work,

Amount and substantiality of the copyrighted work as a whole, and

Effect of the use upon the potential market for or value of the copyrighted work.

Currently, I believe the defense of AI (and your stance, I think?) has mostly been riding on that last one. No chance of plagiarism means no effect on the market value of the original works! They're diluted beyond recognition! That's points in AI's favor.

But the third point up there is basically asking, "how much of the copyrighted material was used to create the work claimed to be protected under fair use?" and this one is the reason fanartists are, by and large, able to make some money on their fanworks while fanauthors really are not. A drawing is a still image, so it "uses" only small pieces of the original work overall in its creation; a written story, on the other hand, can be (and has been) argued to have "used" a significant portion of the original work. If I paint fanart of something for...idk, Supernatural or some other long-running show and sell it, well, I didn't use a substantial amount of the show to create the art. It's a still image; in context of the show it'd be a single frame among millions. But if I write a 500,000 word fanfic that draws on multiple characters and events and plot points from multiple seasons...that's a lot more of the source material! If I sell that, I'm way more likely to get sued than if I painted something.

So-- amount of source material used in comparison to the whole of the source material and profit generated are both problems under fair use. Here again is core of my argument as to why the current setup is inherently, inescapably unethical.

When it comes to data scraping, the original works in their entirety have been used. And they are being used to generate enormous profit. Microsoft gave ten billion dollars to OpenAI last year, that is not insignificant. Profit and substantiality are problems under the fair use doctrine, and-- again-- enough lawyers have agreed with that statement to take multiple cases to court over this. So far, the courts have not ruled in their favor and I can see why, but my point is simply that this is a fair use issue! We don't have to agree one way or the other on what bits are more or less important-- I'm just explaining why I said what I did and why I do stand by it. Yes, there are arguments to be made in either direction, but if you are familiar with fair use, you will see issues here.

But ultimately, fair use isn't really part of my argument. More just an aside. Maybe generative AI is perfectly defensible on all counts under fair use and I've just got my head up my ass, it's whatever. I'm interested to see how the various cases play out. Either way, even if generative AI is 100% defensible under the fair use doctrine, I do not agree that its use in its current setup is ethical.

If you've made it this far, kudos, and thank you for listening. Again, I absolutely do see your point, and I'm sorry, but I disagree. Theft for profit cannot be diluted to a point where it can be called ethical.

Why is JSTOR using AI? AI is deeply environmentally harmful and steals from creatives and academics.

Thanks for your question. We recognize the potential harm that AI can pose to the environment, creatives, and academics. We also recognize that AI tools, beyond our own, are emerging at a rapid rate inside and outside of academia.

We're committed to leveraging AI responsibly and ethically, ensuring it enhances, rather than replaces, human effort in research and education. Our use of AI aims to provide credible, scholarly support to our users, helping them engage more effectively with complex content. At this point, our tool isn't designed to rework content belonging to creatives and academics. It's designed to allow researchers to ask direct questions and deepen their understanding of complex texts.

Our approach here is a cautious one, mindful of ethical and environmental concerns, and we're dedicated to ongoing dialogue with our community to ensure our AI initiatives align with our core values and the needs of our users. Engagement and insight from the community, positive or negative, helps us learn how we might improve our approach. In this way, we hope to lead by example for responsible AI use.

For more details, please see our Generative AI FAQ.

#i am well aware that the logical end point of my problem is ''this technology should not exist in its current state at all''#and i'm well aware that mine is not a popular stance #but i say this as someone who works with a lot of small businesses (''small'' meaning under $25MM/yr): if your business cannot afford #to pay its employees & contractors living wages #then your business is a failure. you have failed. if the only way you make profit is by exploiting and undervaluing others' work #then your profit is stolen wages #this generative ai dataset nonsense is the same thing but instead of wages it's...royalties. i suppose. residuals.#i don't think there's a fully accurate term for it yet; the law has not caught up #my point is: i cannot claim to support everyone's right to receive the fair value of their labor #and then turn around and cheerfully ask a robot to build me a sandcastle out of stolen fucking labor #that does not fucking follow. i am sorry but those are incompatible stances.#i am not normally this inflexible #but the only way this follows is if you believe art (including written art) is not actually work with any value #in which case #i'm going to break into your home and take an enormous shit in the vegetable drawer of your refrigerator #but also you are factually wrong - it is valuable work - as proven by OpenAI's bottom fucking line #currently built on massive art theft #long post #and yes i am aware of OpenAI Global's corporate structure #it does not actually change my stance #frankly even if they were still a nonprofit-- which now they are a for-profit subsidiary of their parent non-profit (gee i wonder why)#(just kidding i don't have to wonder)-- even if they were still a nonprofit i'd have the same problem #nonprofits still generate profit; the difference is they can't distribute those profits to shareholders #but they can pay them to their employees and executives (:#ai bs

122 notes · View notes